Introduction

This analysis examines simulated income data for Hungary, focusing on the relationships between income and various demographic factors including age, location, occupation, and gender. The dataset is simulated to reflect realistic patterns while maintaining a manageable size for analysis.

Data Simulation

The data was simulated by the data_simulation.R script. The data is available in the hungarian_income_data.csv file.

Important things to note about the generated data:

  • Only the 8 most populated cities of Hungary are taken into count weighted by their population. List of cities: Budapest, Debrecen, Szeged, Miskolc, Pécs, GyÅ‘r, Szombathely, Eger.

  • Only the 10 most common occupations are taken into count weighted by their frequency in the workforce. List of occupations: Software Developer, Teacher, Doctor, Sales Representative, Engineer, Accountant, Nurse, Manager, Chef, Driver.

  • The age distribution is generated by a beta distribution with parameters \(\alpha = 2\) and \(\beta = 3\) and multiplied by 95 to put the end result in the desired range. The beta distribution with the aforementioned parameters skews the age distribution towards younger ages, which is more realistic.

  • There are three groups of people categorized by their age:

    • Underage: 0-18

    • Working age: 18-65

    • Pension age: each person has a random retirement age between 60 and 75.

  • Under 18 people have no income.

  • Working age people have a regular income based on their age, occupation, city, and gender.

  • Pension age people have a pension based on their occupation and city.

  • All working age people are considered to be employed.

  • The income of a working age man is 20.000 HUF higher than the income of a working age woman in the same occupation, city, and age group.

Descriptive Statistics

summary_stats <- summary(data)
kable(summary_stats, caption = "Summary Statistics of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Summary Statistics of the Dataset
age city occupation gender income retirement_age
Min. : 0.00 Length:10000 Length:10000 Length:10000 Min. :-40610 Min. :60.00
1st Qu.:23.00 Class :character Class :character Class :character 1st Qu.: 97621 1st Qu.:64.00
Median :36.00 Mode :character Mode :character Mode :character Median :321771 Median :68.00
Mean :37.87 NA NA NA Mean :301918 Mean :67.57
3rd Qu.:51.00 NA NA NA 3rd Qu.:472848 3rd Qu.:72.00
Max. :94.00 NA NA NA Max. :850546 Max. :75.00
cat("\nCity Distribution:\n")
## 
## City Distribution:
print(table(data$city))
## 
##    Budapest    Debrecen        Eger        Győr     Miskolc        Pécs 
##        3468        1543         658         806         974         853 
##      Szeged Szombathely 
##        1160         538
cat("\nOccupation Distribution:\n")
## 
## Occupation Distribution:
print(table(data$occupation))
## 
##           Accountant                 Chef               Doctor 
##                 1190                  507                  492 
##               Driver             Engineer              Manager 
##                  533                  980                  775 
##                Nurse Sales Representative   Software Developer 
##                 1209                 2063                  772 
##              Teacher 
##                 1479
cat("\nAge Distribution Summary:\n")
## 
## Age Distribution Summary:
print(summary(data$age))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   23.00   36.00   37.87   51.00   94.00
cat("\nRetirement Age Distribution:\n")
## 
## Retirement Age Distribution:
print(table(data$retirement_age))
## 
##  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75 
## 611 631 602 655 627 583 627 603 639 605 597 645 609 629 625 712
cat("\nAge Group Distribution:\n")
## 
## Age Group Distribution:
print(table(cut(data$age, breaks = c(0, 15, 65, Inf), labels = c("0-14", "15-64", "65+"))))
## 
##  0-14 15-64   65+ 
##  1305  7774   920
cat("\nIncome Distribution Summary:\n")
## 
## Income Distribution Summary:
print(summary(data$income))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -40610   97621  321771  301918  472848  850546
cat("\nIncome Distribution by Age Group:\n")
## 
## Income Distribution by Age Group:
print(tapply(data$income, cut(data$age, breaks = c(0, 18, 65, Inf), labels = c("Under 18", "Working Age", "Pension Age")), summary)) 
## $`Under 18`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -40610       0       0   12366       0  314097 
## 
## $`Working Age`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -21998  272896  392635  384598  498549  815483 
## 
## $`Pension Age`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3800   40963   71470  208533  504249  850546

Demographics Analysis

library(forcats)

data <- data %>%
  mutate(age_group = cut(age, 
                        breaks = seq(0, 100, by = 2), 
                        right = FALSE, 
                        include.lowest = TRUE, 
                        labels = seq(0, 98, by = 2)))

dem_pyramid <- data %>%
  group_by(age_group, gender) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(count = ifelse(gender == "Male", -count, count))

ggplot(dem_pyramid, aes(x = age_group, y = count, fill = gender)) +
  geom_bar(stat = "identity", width = 0.8, color = "black") +
  scale_y_continuous(labels = abs, expand = expansion(mult = c(0.05, 0.05))) +
  scale_fill_manual(values = c("Male" = "#00BFFF", "Female" = "#FF3B3B")) +
  coord_flip() +
  labs(title = "Population Pyramid of Simulated Hungarian Data",
       x = "Age Group",
       y = "Count",
       fill = "Gender") +
  custom_theme +
  theme(legend.position = "top",
        axis.text.y = element_text(size = 10, face = "bold"),
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20))

ggplot(data %>% filter(age >= 18), aes(x = income, fill = gender)) +
  geom_density(alpha = 0.6) +
  scale_fill_viridis_d() +
  labs(title = "Income Distribution by Gender",
       subtitle = "(working age only)",
       x = "Income (HUF)",
       y = "Density") +
  custom_theme

ggplot(data, aes(x = age, y = income, color = gender)) +
  geom_point(alpha = 0.1, width = 0.2) +
  scale_color_viridis_d() +
  labs(title = "Income Distribution by Age",
       subtitle = "(working age only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = reorder(occupation, income, FUN = median), y = income, color = occupation)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(alpha = 0.1, width = 0.2) +
  scale_color_viridis_d() +
  coord_flip() +
  labs(title = "Income Distribution by Occupation",
       subtitle = "(working age only)",
       x = "Occupation",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = reorder(city, income, FUN = median), y = income, fill = city)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, alpha = 0.5) +
  scale_fill_viridis_d() +
  coord_flip() +
  labs(title = "Income Distribution by City",
       subtitle = "(working age only)",
       x = "City",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = age, y = income, color = gender)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = TRUE) +
  scale_color_viridis_d() +
  labs(title = "Relationship between Age and Income",
       subtitle = "(working age only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

income_by_category <- data %>%
  filter(age >= 18) %>%
  group_by(occupation, city, gender) %>%
  summarise(
    mean_income = mean(income),
    count = n(),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_income))

# heatmap
ggplot(income_by_category, aes(x = city, y = occupation, fill = mean_income)) +
  geom_tile() +
  scale_fill_viridis(name = "Mean Income (HUF)") +
  facet_wrap(~gender) +
  labs(title = "Mean Income by Occupation, City, and Gender",
       subtitle = "(working age only)",
       x = "City",
       y = "Occupation") +
  custom_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text = element_text(face = "bold"))

top_earners <- income_by_category %>%
  arrange(desc(mean_income)) %>%
  head(10)

kable(top_earners, 
      caption = "Top 10 Highest Earning Combinations",
      digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE)
Top 10 Highest Earning Combinations
occupation city gender mean_income count
Software Developer Budapest Female 520884 104
Software Developer Budapest Male 509372 113
Doctor Budapest Male 508114 70
Software Developer Debrecen Male 507967 45
Software Developer Szeged Female 505631 46
Doctor Budapest Female 494593 91
Manager Szeged Male 475521 32
Manager Budapest Male 471953 122
Software Developer Szeged Male 454736 39
Manager Budapest Female 452448 103

Hypothesis Testing

Parametric Tests

1. Gender Income Difference (t-test)

# Test if there's a significant difference in income between genders
t_test_result <- t.test(income ~ gender, data = data)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  income by gender
## t = -4.7355, df = 9997.1, p-value = 2.215e-06
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -28031.96 -11618.99
## sample estimates:
## mean in group Female   mean in group Male 
##             291932.1             311757.5

2. City Income Differences (ANOVA)

# Test if there are significant differences in income between cities
city_anova <- aov(income ~ city, data = data)
print(summary(city_anova))
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## city           7 1.189e+13 1.699e+12   39.73 <2e-16 ***
## Residuals   9992 4.273e+14 4.277e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Occupation Income Differences (ANOVA)

# Test if there are significant differences in income between occupations
occupation_anova <- aov(income ~ occupation, data = data)
print(summary(occupation_anova))
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## occupation     9 1.529e+13 1.699e+12   40.03 <2e-16 ***
## Residuals   9990 4.239e+14 4.243e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-parametric Tests

1. Gender Income Distribution (Kolmogorov-Smirnov Test)

# Test if income distributions differ between genders
male_income <- data$income[data$gender == "Male"]
female_income <- data$income[data$gender == "Female"]
ks_test <- ks.test(male_income, female_income)
print(ks_test)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  male_income and female_income
## D = 0.057746, p-value = 1.149e-07
## alternative hypothesis: two-sided

2. Income Distribution by City (Kruskal-Wallis Test)

# Test if income distributions differ between cities
kruskal_test <- kruskal.test(income ~ city, data = data)
print(kruskal_test)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  income by city
## Kruskal-Wallis chi-squared = 264.96, df = 7, p-value < 2.2e-16

Regression Analysis

Multiple Linear Regression

# Convert categorical variables to factors
data_working_age <- data %>% filter(age >= 18)
data_working_age$city <- as.factor(data_working_age$city)
data_working_age$occupation <- as.factor(data_working_age$occupation)
data_working_age$gender <- as.factor(data_working_age$gender)

# Fit the model
model1 <- lm(income ~ age + I(age^2) + city + occupation + gender, data = data_working_age)

# Create a beautiful model summary table
model_summary <- summary(model1)
kable(tidy(model_summary), caption = "Multiple Linear Regression Results (Working Age Only)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Multiple Linear Regression Results (Working Age Only)
term estimate std.error statistic p.value
(Intercept) -467868.527 9309.945146 -50.254703 0.0000000
age 39661.233 386.234283 102.686982 0.0000000
I(age^2) -394.207 4.035882 -97.675537 0.0000000
cityDebrecen -42913.462 3640.947447 -11.786345 0.0000000
cityEger -91510.468 5027.889908 -18.200571 0.0000000
cityGyőr -97073.820 4673.807495 -20.769752 0.0000000
cityMiskolc -94477.112 4361.810594 -21.660068 0.0000000
cityPécs -91640.043 4576.647811 -20.023398 0.0000000
citySzeged -42170.411 4098.279511 -10.289784 0.0000000
citySzombathely -88470.188 5558.759739 -15.915455 0.0000000
occupationChef -41621.955 6327.094928 -6.578367 0.0000000
occupationDoctor 71151.327 6392.755614 11.129993 0.0000000
occupationDriver -49173.742 6302.717112 -7.801991 0.0000000
occupationEngineer 16834.564 5144.748515 3.272184 0.0010715
occupationManager 33771.712 5539.603923 6.096413 0.0000000
occupationNurse -27907.625 4867.348917 -5.733640 0.0000000
occupationSales Representative -57131.282 4331.087143 -13.190980 0.0000000
occupationSoftware Developer 86196.366 5512.034927 15.637848 0.0000000
occupationTeacher -14876.721 4641.740538 -3.204988 0.0013558
genderMale 23252.543 2393.194219 9.716112 0.0000000
# Enhanced model diagnostics
par(mfrow = c(2,2))
plot(model1, col = income_palette[1], pch = 19, cex = 0.7)

Polynomial Regression for Age-Income Relationship

# Fit polynomial regression
model2 <- lm(income ~ poly(age, 3), data = data_working_age)

# Create a static plot
ggplot(data_working_age, aes(x = age, y = income)) +
  geom_point(alpha = 0.1, color = income_palette[1]) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), 
              color = income_palette[5], fill = income_palette[5], alpha = 0.2) +
  labs(title = "Polynomial Regression: Age vs Income",
       subtitle = "Cubic polynomial fit with confidence interval (Working Age Only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

# Model comparison
model_comparison <- data.frame(
  Model = c("Multiple Linear", "Polynomial"),
  R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared),
  Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)

kable(model_comparison, caption = "Model Comparison (Working Age Only)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Model Comparison (Working Age Only)
Model R_squared Adj_R_squared
Multiple Linear 0.6184562 0.6175882
Polynomial 0.5210928 0.5209211

Predictions

# Create a sample prediction
new_data <- data.frame(
  age = 35,
  city = "Budapest",
  occupation = "Software Developer",
  gender = "Male"
)

# Predict income
prediction <- predict(model1, newdata = new_data, interval = "prediction")
print(prediction)
##      fit      lwr      upr
## 1 546820 332175.4 761464.5

Summary and Conclusions

The analysis of the simulated Hungarian income data reveals several interesting patterns:

  1. There is a significant gender pay gap, with males earning more on average than females.
  2. Income varies significantly across different cities, with Budapest showing the highest average income.
  3. Occupation has a strong effect on income, with software developers and doctors earning the most.
  4. Age shows a non-linear relationship with income, peaking in the middle age range.
  5. The regression models explain a significant portion of the income variation.

The analysis demonstrates the complex interplay between various factors affecting income levels in Hungary. The findings suggest that both demographic factors (age, gender) and professional factors (occupation, location) play important roles in determining income levels.